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Abstract 

Background: Xor-genotype is a cost-effective alternative to the genotype sequence of an individual. Recent 
methods developed for haplotype inference have aimed at finding the solution based on xor-genotype data. Given 
the xor-genotypes of a group of unrelated individuals, it is possible to infer the haplotype pairs for each individual 
with the aid of a small number of regular genotypes. 

Results: We propose a framework of maximum parsimony inference of haplotypes based on the search of a sparse 
dictionary, and we present a greedy method that can effectively infer the haplotype pairs given a set of xor-genotypes 
augmented by a small number of regular genotypes. We test the performance of the proposed approach on synthetic 
data sets with different number of individuals and SNPs, and compare the performances with the state-of-the-art 
xor-haplotyping methods PPXH and XOR-HAPLOGEN. 

Conclusions: Experimental results show good inference qualities for the proposed method under all circumstances, 
especially on large data sets. Results on a real database, GFTR, also demonstrate significantly better performance. 
The proposed algorithm is also capable of finding accurate solutions with missing data and/or typing errors. 



Background 

A human genome is a sequence of nucleotides that can 
differ from one individual to another (approximately 
0.1% difference between any two individual) due to var- 
ious reasons, such as insertions/deletions of fractions 
of the sequence on the genome or mostly the sub- 
stitution/mutation of single nucleotides on commonly 
observed sites called single nucleotide polymorphism 
(SNP) [1]. In most SNPs only two different nucleotides 
are observed out of 4 nucleotides. The information of 
nucleotide variations extracted from these SNP sites (loci) 
is encoded as a sequence called "haplotype". That is, for 
a particular SNP site a notation is used for one of the 
observed nucleotides (e.g., the most commonly observed 
nucleotide variant - dominant/major allele) and another 
notation is used for the other (e.g., the least observed 
nucleotide variant - recessive/minor allele). Because of 
its informative and heredity nature identifying the hap- 
lotypes of individuals has been an important subject in 
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various medical and scientific studies, such as gene related 
disease discovery and drug design [2,3], population his- 
tory research [4], etc. Nonetheless, current experimen- 
tal techniques are not low-cost and efficient enough for 
directly sequencing haplotypes of an individual; thereby 
identifying them is mostly based on indirect approaches, 
e.g., using computational methods to infer haplotypes 
from an alternative cost-effective data called "genotype". 

The entire human genome consists of 23 distinct chro- 
mosomes each appearing in two copies (autosomes) 
except for the chromosome-23 (allosome) which con- 
sists of two copies of chromosome-X in females or 
one chromosome-X and one chromosome-Y in males. 
Each chromosome is a pair of two distinct sequences - 
haplotypes- inherited from the parents, i.e., one is from 
the maternal genome and the other is from the pater- 
nal genome. The genotype is sequenced by identifying 
the types of alleles -nucleotide variants- across the SNP 
locations (locus) in chromosomes. In a particular locus 
of a chromosome if both haplotypes have the same allele 
we call this site in the genotype homozygous and denote 
it with the type of alleles in both haplotypes as either 
common-type or wild-type; otherwise, if both haplo- 
types have different alleles -one common-type and one 
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wild-type- we call this site heterozygous. When identify- 
ing haplotypes for a given genotype, the ambiguity occurs 
for the heterozygous sites since there is no informa- 
tion about which haplotype has the common-type allele 
and which haplotype has the wild-type allele. Clearly, 
genotypes are less informative than haplotypes, as they 
present an ambiguity on heterozygous sites due to pos- 
sible permutations and computational methods can be 
employed to identify which allele come from which hap- 
lotype. Recently, more cost-effective alternative methods 
have been used for genotype sequencing [5], e.g., widely 
used denaturing high-performance liquid chromatography 
(DHPLC) [6]. By certain applications of such methods one 
can only determine whether an individual has homozy- 
gous or heterozygous allele in a given SNP site, but cannot 
distinguish the type of allele in homozygous sites. The 
sequenced data is thereby less informative than the reg- 
ular genotypes as it only represents the differing sites 
(XOR operation) between the haplotypes. This less infor- 
mative form of genotype is named xor-genotype. One can 
solve the haplotype inference problem based on the xor- 
genotypes, i.e., xor-haplotyping, with a reasonable extra 
computational effort. 

Methods for solving the haplotype inference problem 
given the regular genotypes can be summarized in two 
categories: combinatorial methods that usually state an 
explicit objective function and propose methods for opti- 
mizing it, and statistical methods that relies on the statis- 
tical modeling of the problem. Various methods have been 
published for the haplotype inference problem [7-13], 
however the xor-haplotyping problem mostly remained 
under-investigated. Two particular methods are suitable 
for xor-haplotyping problems: parsimony haplotyping 
that is based on the maximum parsimony principle, and 
perfect phylogeny haplotyping that relies on a population 
genetics assumption called the infinite sites/ alleles model 
[14], i.e., it assumes that allele sequences are long enough 
so that a particular allele will have a mutation only once in 
the phylogenetic tree. The perfect phylogeny (PP) model 
utilizes the infinite sites assumption by building a tree 
of individuals -haplotypes- where all individuals evolve, 
with no recurrent mutation, from one common ances- 
tor. An approximate solution to xor-haplotyping problem 
in the case of PP model was introduced in [15] where 
the xor-haplotype inference was cast as a graph real- 
ization problem [16,17]. However, the proposed method 
(GREAL) in [15] is not well-suited for the xor-genotypes 
with large number of SNPs, i.e., usually limited by 30 SNPs 
[18], and is not extended to missing data cases. 

On the other hand, it is known that in a population of 
individuals certain haplotypes are frequently found in cer- 
tain genomic regions [19]. This fact leads to the parsimony 
principle that states that the genotypes of a population of 
individuals are generated by the least number of distinct 



haplotypes. Identifying such smallest set of haplotypes 
is called Pure (Maximum) Parsimony Problem, which is 
NP-hard [20] . An integer linear programing method was 
introduced in [21] that finds a pure parsimony solution to 
this problem, and in [22] a branch-and-bound method was 
used to solve pure parsimony problem. In [23] a method 
called XOR-HAPLOGEN was proposed for solving hap- 
lotype inference problem in the case of xor-genotype 
data. This method can find accurate solutions for xor- 
genotypes with large number of SNPs. Another parsimony 
method was introduced in [24] for xor-haplotype infer- 
ence by representing it as a graph realization problem 
called pure parsimony xor haplotyping (PPXH). 

In [25] a novel framework for (regular) haplotyping was 
proposed by interpreting the parsimony principle as a 
sparse representation of the genotypes. Two approaches 
are presented: maximizing a sparseness condition on the 
haplotype frequency vector determined by the inferred 
haplotypes, and casting the sparsity of this frequency 
vector as a sparse dictionary selection problem. The lat- 
ter approach is based on an efficient greedy method 
SHSD where haplotypes explaining the given genotypes 
are determined according to a sparse selection from the 
set of compatible haplotypes. The method constructively 
determines the solution of each individual while selecting 
the haplotypes from this set, and it has the convergence 
guarantee. 

For the xor-haplotyping problem, there is an increased 
ambiguity due to the XOR operation between haplo- 
types, i.e., the process of xor-genotyping that determines 
whether the type of alleles in both haplotypes differ in 
a particular site (heterozygous) or they are the same 
(homozygous). However this ambiguity can be resolved 
with the assistance of regular genotypes. Regular geno- 
types can either be used as post-processing inputs for 
eliminating set-equivalent solutions of a particular infer- 
ence, or they can be used to refine inference while con- 
structing the solution. 

Tractability of the maximum parsimony haplotyping 
problem in the xor-genotype case is still open [24]. In 
this paper, we propose a modified version of SHSD — 
XHSD— , that can efficiently find a solution for maxi- 
mum parsimony xor-haplotyping problem and resolve the 
ambiguity with the help of a small number of regular 
genotypes. For a given set of xor-genotypes the haplo- 
type pairs for each individual are selected from the set 
of compatible haplotypes by a sparse dictionary selection 
method. The selection of dictionary columns from the set 
of compatible haplotypes and the sparse representation of 
xor-genotypes is formulated as a joint combinatorial opti- 
mization problem. The objective function of this problem 
maximizes a variance reduction metric over all individ- 
uals. Our algorithm is a low-complexity greedy method 
that terminates once the solution is fully determined. To 
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resolve the ambiguity and to improve the inference accu- 
racy, we employ a small number of regular genotypes as 
constraints for the set of compatible haplotypes to help 
resolve the type of homozygous alleles. 

The remainder of the paper is organized as follows. In 
Preliminaries, we introduce the xor-haplotype inference 
problem. In Methods, we formulate the xor-haplotype 
inference as a sparse dictionary selection problem and 
present an efficient greedy method for solving this prob- 
lem. We also discuss the use of regular genotypes to 
resolve ambiguity. In Extensions section we discuss how 
the algorithm deals with long sequences and data with 
missing sites. In Results and discussion, we present the 
experimental results on synthetic and real data sets under 
various conditions. Finally, the Conclusions section is 
given in the end. 

Preliminaries 

In an SNP locus only 2 nucleotides are observed, and a 
single bit is sufficient for the representation of nucleotide 
variants such that 0 encodes the major allele and 1 encodes 
the minor allele. The haplotype of an individual can 
thereby be represented with a binary vector that shows 
the SNP variants across the individuals chromosome. The 
genotype can then be thought of as a ternary vector where 
a 0 (2) indicates that the site is homozygous and both 
haplotypes have major 0/0 (minor 1/1) alleles, and 1 indi- 
cates that the site is heterozygous and the haplotypes have 
different alleles 0/1 or 1/0. Notice that when encoding 
homozygous and heterozygous sites we used a different 
notation from the literature in order to express a genotype 
vector as the sum of two haplotypes: a minor-homozygous 
SNP is encoded with 2 and a heterozygous SNP is encoded 
with 1, so that a 2 in the genotype is given by (the sum of) 
two minor alleles, and a 1 in the genotype is given by (the 
sum of) one major and one minor allele. 

In general, given a length-L genotype vector, k < L of 
the loci are heterozygous and thereby ambiguous, in each 
of the k sites one haplotype can take two values -0 or 
1- and the other haplotype takes the complement value. 
Considering all k heterozygous sites, one haplotype can 
then be one of the 2 k binary sequences, and the other hap- 
lotype will be the complement (inverted values) of that 
sequence. Therefore, for solving a genotype with k het- 
erozygous sites, the pair of haplotypes is drawn from a set 
of 2 k distinct binary vectors of length-L. 

On the other hand, in xor-haplotyping problem the con- 
flated data — xor-genotype — is less informative than 
the regular genotype with respect to the information 
loss about the type of allele in homozygous sites. The 
xor-genotype is itself a binary vector, where for a given 
site, 1 indicates heterozygous SNP where both haplotypes 
have different alleles for this given site. The xor-genotype 
can be represented by the XOR sum of two haplotypes, 



likewise, for a given site 0 indicates a homozygous SNP 
where the haplotypes have the same allele but without 
any distinction whether the type of the allele is major or 
minor. That is, the xor-genotype contains the information 
whether a particular SNP site has homozygous alleles, but 
the type of alleles for those homozygous sites is not identi- 
fied. Every site of an xor-genotype is ambiguous, and each 
site of the corresponding haplotype can take two values. 
Therefore, a length-L xor-genotype can be explained by 
a pair of haplotypes that are drawn from a set of 2 L dis- 
tinct binary vectors of length-L. Hence, because of the 
additional ambiguity on homozygous sites, the number of 
possible solutions for an xor-genotype is significantly (in 
fact, exponentially) larger than that of a regular genotype 
of the same size. 

Besides the xor-haplotyping problem is NP-hard, there 
is also no unique solution to this problem. The nature of 
the XOR operation results in a phenomenon called bit 
flip degree of freedom [15], i.e., for a particular solution 
set H consisting of length-L haplotypes, one can produce 
equivalent solution sets by inverting a certain SNP i < L 
(or a set of SNPs S c {1,...,!}) in all haplotypes of 
H . Notice that inverting (complementing) an SNP across 
all haplotypes has no effect on the xor-genotypes they 
generated, because even the alleles explaining homozy- 
gous sites of xor-genotypes are not distinguished (hidden). 
More specifically, assume that /z, 1 CO e {0,1} and h](l) e 
{0, 1} represent the haplotypes of i-th. individual in the 
i-th SNP and they generate that individuals xor-genotype 
Xi(l) such that x t {i) = h}(l) © tif(i). Then the comple- 
mented SNPs of haplotypes also explain that SNP of the 

same xor-genotype, i.e., xt(l) = h)(l) © hj(l). It then 
follows that for a particular set H of length-L haplotypes 
that solves a given set of xor-genotypes, there are at most 
Y% =1 (\) = 2 L - 1 equivalent sets H\, i = 1, . . . 2 L - 1 to 
H where each equivalent set H\ also solves that given set 
of xor-genotypes. 

Problem definition 

For each SNP i = 1, . . . ,L, the xor-genotype is given by 
the XOR-sum of two haplotypes such that, 

Xi(l)=h}(l)eh}(l), l = l,...,L, (1) 

where xt(l) e {0, 1} is the xor-genotype of the i-th indi- 
vidual in SNP i, and e {0, 1} is the j-th haplotype 
of the i-th individual in SNP L Let (1)... Xi(L)Y 
be the xor-genotype of the i-th individual, then (1) can be 
written as 

Xi = h) © hf (2) 

where U { = . . . ti.(L)] T is the j-th (j=l,2) haplo- 

type of the i-th individual consisting of L SNPs. In this 
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representation, we say that the xor-genotype of the i-th 
individual %i is phased by the haplotype pair {h], hj}. 

In regular haplotyping, a putative haplotype z e {0, 1} L 
is called compatible with a genotype g e {0, 1, 2} L if 
(g" — z) e {0, and such a haplotype is a possible 
solution that can explain that genotype. That is, the hap- 
lotype pair {z, (g — z)} is one of the possible solutions to 
the genotype g. Therefore, for every given genotype g t it 
is essential to determine a set of compatible haplotypes 
Hi when searching for possible solutions. The union of 
the sets H\, . . . ,Hn for N individuals forms the matrix 
Z e {0, l} LxM where M is the total number of distinct 
compatible haplotypes. 

In xor-haplotyping, on the other hand, it is trivial to see 
that any haplotype z e {0, 1} L is compatible (consistent) 
with any xor-genotype x, i.e., x = z 0 z! since there 
always exists a haplotype z' e {0, 1} L such that z! = x(B z. 
Therefore, the set of compatible haplotypes Hi for a given 
length-Z, xor-genotype Xi consists of all binary vectors of 
length-!, i.e., Hi = H 2 = • • • = H N = {0, l} Lx2L 4 Z. 

Because of this compatibility between the xor-genotypes 
and candidate haplotypes an SNP site can always be 
explained by either of the two alleles, and thus unambigu- 
ous SNPs do not exist anymore. Notice that, in partic- 
ular, an xor-genotype with all-homozygous SNPs is still 
ambiguous and requires to be solved up to bit flipping. 
However, we know that such an xor-genotype is always 
explained by a pair of identical haplotypes which corre- 
spond to the same column of Z. On the other hand, if there 
is at least one heterozygous SNP in the xor-genotype then 
its phasing haplotypes are not identical and correspond to 
the different columns in Z. 

The xor-genotype of i-th individual is expressed as 

Xi = (Z nh (3) 

where (.)2 represents the component- wise modulo-2 oper- 
ation, and Vi e {0, 1, 2} M , l r v; = 2, is the sparse 
vector indicating the haplotype locations as the indices 
of the matrix Z of consistent haplotypes. Notice that 
the modulo-2 operation in (3) is equivalent to the XOR 
operation between the two haplotypes selected by v/. 

Given Z, finding the indicator vector v/ for an individ- 
ual is equivalent to inferring its haplotype pair {h} f hj}. 
The maximum parsimony principle suggests that a given 
set of xor-genotypes should be explained by the smallest 
number of distinct haplotypes. Therefore, given the set 
of xor-genotypes for N individuals {x^ i = 1, . . . ,N], 
one needs to infer the haplotype pairs for each individual 
{h}, hj, i = 1, . . . , N}, so that the union of all inferred hap- 
lotypes forms the smallest set as possible. In other words, 
the xor-haplotyping problem is to infer Vj, / = 1, . . . , N, 
given Z while selecting as few columns of Z as possible. 



Methods 

Xor Haplotyping by Sparse Dictionary Selection (XHSD) 

If an (all-homozygous) xor-genotype is explained by only 
one haplotype, i.e., Xi = h s 0 h s , where the haplotype h s is 
the 5-th column of Z, then the indicator vector multiplies 
that haplotype by 2, i.e., v(s) = 2 and v(j) = 0 for / = 
{1 . . . 2 L }\{s}. Otherwise, if the xor-genotype is explained 
by two different haplotypes Xi = h™ 0 h" , m 7^ n, then 
they are indicated by the vector v such that v(m) = v(n) = 
1 and v(j) = 0 for j = {1 . . . 2 L }\{m, n\. Hence, we can 
rewrite (3) in the following more compact form 

Xi = (Z Ai (4) 

where Ai is a set of indices corresponding to the nonzero 
elements of v/, Z^. is the submatrix of Z consisting of the 
columns indexed by Au and v\ is the non-zero elements 

Of Vi. 

For each observed xor-genotype xu the phasing haplo- 
types are located in columns of Z indexed by Ai. The 
union of these column indices, i.e., V = UiAi, forms 
the dictionary of the haplotypes that suffices to construct 
all given xor-genotypes. The maximum parsimony prin- 
ciple then dictates that the dictionary V should contain 
the least possible number of elements that can reconstruct 
all observed xor-genotypes. The set of haplotypes indi- 
cated by such a sparse dictionary V is given by H = Zx>, 
where Z^ is the submatrix of Z consisting of the columns 
indexed by V. Then H is a solution set to the maxi- 
mum parsimony haplotyping problem for the given set of 
xor-genotypes {xt , i = 1, . . . ,N}. 

To solve the xor-haplotyping problem, we choose the 
sparse dictionary V to minimize the average distance 
between the observed xor-genotypes and the closest 
approximations constructed by the haplotypes in Zx>. 
Since there is no prior information about the dictionary 
V and the indices Ai for proper reconstruction of each 
xor-genotype, determining V and Ai leads to a combi- 
natorial problem. This joint-optimization problem can be 
efficiently solved by a greedy method that we will explain 
next. 

For an observed xor-genotype the reconstruction accu- 
racy can be interpreted as the Euclidean distance between 
the observation and its closest approximation, i.e., 

Li(A) = min\\xi-(Z A Vi) 2 \\ 2 , (5) 

where A represents the indices of haplotypes in Z used to 
approximate Xi. Notice that an exact solution will satisfy 
Li(A) = 0. For a given dictionary V, the indices Ai for 
reconstructing each xor-genotype will be determined by 
restricting Ai to be a subset of V such that 

Ai = arg min U{A). (6) 

A<^V,\A\<2 
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The individual cost function in (5) is then translated into a 
fitness function associated with a given dictionary V, i.e., 

F i {V)=\\x i \\ 2 - min Lt(A). (7) 

A£T>,\A\<2 

Finally, the fitness value of V is averaged over all indi- 
viduals to measure the overall reconstruction accuracy 

1 N 

i—l 

For a given cardinality (sparsity) of n, the best dictionary 
is therefore given by 

£>* = arg max F(V), (8) 

\V\<n 

and the sparsest dictionary that is sufficient to reconstruct 
all observed xor-genotypes is determined by 

D* = min |p* : F(V*) = ^ E H^" 2 } • < 9 > 

Notice that determining both V for a given n in (8) and 
A for a given V in (7) is a combinatorial problem. In [26], it 
is shown that such combinatorial problems can be approx- 
imately solved efficiently by a simple greedy method if 
the objective function satisfies a fundamental property 
called submodularity \ In [25], it is shown that the dictio- 
nary selection problem for (regular) haplotype inference 
has a cost function that is approximately submodular, and 
when a greedy method is used to optimize this cost func- 
tion it can efficiently find an approximate solution with a 
theoretical guarantee [27] . 

For xor-haplotype inference, on the other hand, the 
problem is fundamentally different. That is, the submod- 
ularity property may not hold for the cost function in 
(5) due to the XOR operation, and thereby the theoreti- 
cal guarantee does not hold either for the greedy method. 
Nonetheless, we still use the similar greedy heuristic as 
SHSD in [25] in order to maximize the variance reduction 
metric in (5) over the set of observations. 

In our algorithm Xor Haplotyping by Sparse Dictionary 
Selection (XHSD), we start with an empty dictionary set 
V\ = (p. Then at each iteration I, among the consistent 
haplotypes that are not already in dictionary i.e., in 

Z\Dg-\, we iteratively add the haplotype that contributes 
to the dictionary Di-\ with the maximal marginal gain. 
That is, at iteration i the haplotype h m e Z\£^_i is added 
toVi-\ if it satisfies 

m = arg max F(2Vi U {it}). (10) 

*€{1...2*}\Z>*_i 

To compute (10) requires solving (5) and (6) for each /<. 
In (6) for each individual /, At is found by computing the 
Euclidean distance (5) between X[ and the possible recon- 
structions given by the pairwise xor-sum of all columns in 
Zx>, and picking the columns that minimize (6). Whenever 



indices At yield zero in (5) we can explain that individ- 
ual with the corresponding haplotypes in Z, i.e., X{ = 
(%Ai Vi)2< The dictionary T>i keeps growing until all xor- 
genotypes are explained, i.e., F(T>i) = ^X^li^'C^) = 

£E£iII*/II 2 . 

Notice that in XHSD algorithm the number of compat- 
ible haplotypes |Z| exponentially increase in comparison 
to regular haplotyping problem with SHSD. However, 
-when available- we can reduce Z with respect to 
regular genotype information via utilizing them in the 
cost function (5). The necessary modifications are dis- 
cussed in the next section XHSD with regular genotypes. 
Another fundamental difference in xor-haplotyping is that 
the xor-genotypes do not provide unambiguous geno- 
type information which one can initialize the dictionary 
with corresponding haplotypes and improve the recon- 
struction accuracy. Nonetheless, with a bias weight, the 
modified cost function can exploit the available regular 
genotypes even when they are not unambiguous. 

Summary of XHSD algorithm: 

• Initialization. 

- Z={0,l} Ix21 . 

- n <r- 1. 

- K-i = ^ 

• Iterate until all xor-genotypes are explained, i.e., 

F(V*n) = hT,ll\\Xi\\ 2 - 

- Perform the greedy search. 

* For V; e { 1, . . . , 2 L } \ T>*_ v compute 

* Let;* = argmax /e{1 ... 2 i}\i>J_ 1 

HK-iU {/}). 

Set 2^ = V n _ 1 U {;*}. 

* Check if any xor-genotype is 
explained by the addition of the new 
element h 1 , i.e., if (5) is zero. If so, the 
inferred haplotype pair for the 
individual with such an xor-genotype 
is [//, Xi 0/f 7 *]. 

- n <— n + 1. 

Given the xor-genotypes of a set of individuals, this algo- 
rithm finds the haplotypes of each individual based on the 
maximum parsimony principle. 

As an example, consider the following demonstration. 
Let X\,X2 and X3 be the xor-genotypes of three individuals 
each corresponding to three SNPs, i.e., 





"0" 




"0" 




"0" 


X\ = 


0 


, X2 = 


1 


, X3 = 


1 




1 




0 




1 
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The set of compatible haplotypes for these individuals 
will consist of all length-3 binary vectors, i.e., 



01010101 
00110011 
00001111 



After initializing Z, and starting with empty dictionary 
T>o, the algorithm performs the greedy search by adding 
one haplotype from Z (with the maximal marginal gain) 
at a time. At iteration n = 1, (10) calculates m = 
arg max(0, 0, 0, 0, 0, 0, 0, 0) and m is randomly picked as 5 
among the equal maximum values, then the correspond- 
ing haplotype Z5 = [001] T is added to the dictionary, 

( 0" 

T>\ = [5], and Zx> x 



i.e. 



. Similarly, at n 



m = arg max(0.33, 0,0.66, 0,0, 0.33,0) = 3 is calculated 
and the haplotype Z3 = [010] T is added to the dictionary, 

. 00 

i.e., V 2 = [5,3], and Z V2 



01 
10 



. #3 is explained by 



the addition of new haplotype, i.e., #3 = [001] T 0 [010] r , 
yet the other xor-genotypes are not explained. At n — 3, 
m = arg max(1.33, 0.66, 0.66, 0.66, 0.66, 1.33, 0.66) = 1 
is calculated and the haplotype Z\ — [000] T is added to 

[000 

the dictionary, i.e., V3 = [5, 3, 1], and Zt> 3 = 010 

L 100 _ 

Other two xor-genotypes are explained by the new addi- 
tion, i.e., xi = [001] T 0 [000] r , x 2 = [010] T 0 [000] r , and 
the algorithm converges at n = 3 via calculating F(V^) = 

This simple example demonstrates how the proposed 
greedy approach can efficiently construct sparse solu- 
tions, where three xor-genotypes are explained by only 
three haplotypes within three iterations. Nonetheless, the 
solution set has the ambiguity of being one of the equiv- 
alent sets of the true solution due to the bit flip degree of 
freedom which should be resolved. 

Resolving bit flip degree of freedom 

In [15] it is shown that the xor perfect phylogeny problem 
can be solved up to bit flipping based on the charac- 
teristics of the given xor-genotypes. Let X e {0, l} LxN 
be the xor-genotypes matrix of N individuals such that 
X =[x\ X2 . . -Xn], Denote Xi as the set of heterozygous 
loci for the i-th individual, i.e., Xi = : #tC0 = 1}> 
where Xi(l) is the €-th SNP in x^ If there exists a set 
of individuals Ic {1, ... ,N] whose xor-genotypes have 
empty intersection, i.e., C\zXz = </>> then with the knowl- 
edge of regular genotypes G% e {0, 1, 2} Ix ' x ' of those 
individuals one can remove all bit flip degrees of free- 
dom. The empty intersection indicates that an SNP will 



have homozygous allele in at least one of those individ- 
uals and therefore that SNP can be resolved by revealing 
the type of allele at the corresponding regular genotype. 
Following this, a post-processing method is suggested 
in [15] that can remove the bit flip degree of freedom 
across the loci where a set of xor-genotypes have empty 
intersection. 

By bit flipping on a given solution H , one attempts at 
choosing among the set-equivalent solutions H\ t i = 
{1, . . .,2 L — 1} and this choice is decided by the given 
regular genotypes (Figure 1). 

However, this post-processing method have certain lim- 
itations. Notice that, for large L the set-equivalent solu- 
tions are highly specific to the infererred H, e.g., for 
a given set of xor-genotypes it is very likely that any 
two different inferences H 1 and H 2 — which are not set- 
equivalent — can have very different set-equivalent solu- 
tions. Bit flipping on different inferences likely leads to 
different results, and thereby the bit flipping accuracy 
largely depends on the initial inference H which is made 
by avoiding the prior knowledge on homozygous SNPs, 
i.e., regular genotypes. Besides, -when available- utiliz- 
ing more regular genotypes in post-processing does not 
necessarily improve the bit flipping accuracy. Basically, 
to decide among the appropriate bit flippings for a par- 
ticular locus requires the knowledge of that homozygous 
SNP from a regular genotype. Intuitively, to reveal a set 
of homozygous SNPs by employing the least number of 
regular genotypes, e.g., provided by the MTI method, 
will be necessary and sufficient for removing the bit flip 
degree of freedom across those SNPs. On the other hand, 
a larger number of regular genotypes will not be any more 
informative due to possible inconsistencies on the type of 
homozygous allele for an SNP site across the given regular 
genotypes. 

Furthermore, notice that flipping the bits on some loci 
across all the haplotypes in H does not affect the par- 
simony of the solution. The final solution H' will have 
the same parsimony with H regardless of the set of loci 
that are flipped. From the maximum parsimony point 
of view, refining an xor-haplotyping solution via bit flip- 
ping method does not necessarily lead to global optimum 
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Figure 1 Ambiguity resolution for PPXH method. Informative 
regular genotypes Gj are determined by the MTI algorithm, 
and they are used as control inputs for bit flipping on the initial 
inference result//. 
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unless the initial inference is a set-equivalent of the global 
optimal solution. 

Therefore, instead of using regular genotypes to post- 
process a solution, a more intuitive way could be to aim at 
resolving the bit-flip degree of freedom while construct- 
ing the solution. In particular, regular genotypes can be 
used as constraints when solving the homozygous sites of 
an xor-genotype. In this sense, given a set of individuals' 
xor-genotypes we determine the individuals that have the 
most informative regular genotypes and pre-process the 
data set by replacing with the regular genotypes for those 
individuals. The MTI algorithm [15] is useful for finding 
the least number of such individuals that will be adequate 
to reveal the homozygous alleles for each of the L SNPs. 
In the proposed XHSD framework, we employ the MTI 
method to find which individuals should be replaced with 
regular genotypes and after replacing them the new data 
set is presented to the XHSD algorithm (Figure 2). 

In most cases the xor-genotypes in X has empty inter- 
section and for each run MTI outputs 2 or 3 individuals, 
i.e., \X\ < 3; then Gx has at most 3 regular genotypes. 
One can obtain a larger Gx by performing multiple runs of 
MTI with X and collecting the distinct regular genotypes 
given by MTI. 

Next we explain the necessary modifications to the 
XHSD algorithm for utilizing the regular genotypes. 

XHSD with regular genotypes 

The information provided by regular genotypes is used to 
reveal the type of allele in homozygous sites of an individ- 
ual so that we can improve the reconstruction accuracy 
in (5), and build the dictionary V with more reliable hap- 
lotypes. That is, when a regular genotype g t is observed 
in the j-th individual we employ the variance reduction 
metric that is given for regular genotypes such that 

U(A) = mm\\ gi - Z A Vif , (11) 

where Z is the set of haplotypes that are compatible 
with the /-th individuals genotype g it and A contains the 
indices of the haplotypes in Z that are used to approximate 
g t . In this representation the approximation accuracy is 
potentially higher when compared to the xor-genotypes, 
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Figure 2 Ambiguity resolution in XHSD or in XOR-HAPLOGEN 
(XHAP). Informative regular genotypes Gx are determined by the 
MTI algorithm, and they are used as inputs to augment the initial data 
set by replacing xor-genotypes of the individuals X c {1, . . . , N}. 



since the homozygous SNPs in g i are unambiguous. 
The haplotypes that are used to approximate those 
SNPs will be more reliable candidates when building the 
dictionary V. 

To exploit this fact, we can introduce a weight b{ in 
the cost function Li(A) so that the algorithm will give a 
higher priority on the variance reduction of those indi- 
viduals that are given by regular genotypes, and the dic- 
tionary will more likely grow with the haplotypes that are 
compatible with the given regular genotypes. The biased 
variance reduction metric for each individual is then 
given by 

Li A) = \ bi mini/ ' ~ Za i)/ " 2 ' giVen gi 
1 \ min^. II*/ - v/hll 2 , given x t . 

(12) 

The weight parameter bi could be set as proportional 
to the average rate of homozygous SNPs per genotype, 
assuming that the more homozygous sites the regular 
genotype contains the more informative it will be. We 
experimentally set bi = 4 as it yielded good performance 
with both synthetic and real databases. 

Extensions 

Long xor-genotypes 

Note that the size of Z grows exponentially with the 
length-L due to the compatibility between haplotypes and 
xor-genotypes. That is, finding the solution of a length-L 
xor-genotype requires to perform the greedy search over 
Z that consists of 2 L haplotypes. To mitigate the com- 
putational complexity we employ the partition-ligation 
method [28] as in [25] where the block partitioning is 
based on identifying the recombination hot spots [29] 
existing between the haplotype blocks [30]. After parti- 
tioning the SNP sequences will be divided into blocks 
where within each block the haplotype diversity is as low 
as possible. 

The haplotype diversity of a given block is measured by 
its Shannon entropy. The block partitioning by minimiz- 
ing the total Shannon entropy proceeds as follows. Let 

(~ Ivn ~ Ivn 1 ~ 
h x ... hjr \ be the K\ m haplotypes that explains all the 

xor-genotypes x l ™, i = 1, . . . , N in the block that starts at 
locus / and ends at locus m, i.e., 1 < / < m < L, and let 
f = Iff™, . . . ,fl m be the haplotype frequency vector 

L Kim J 

for this block. Each fj^ m , k = 1, . . . K[ m is represented by 
the density of the nonzero values of the indicator vectors 
{v'f, . . . , v l $] for the given block i.e., 
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The entropy of the haplotype block h k is then given by 

Kim 

E(l,m) = -jyplogff» , 
k=i 

and the total entropy of Q blocks, where each block 
[l q : m q ] , q = 1, . . . Q has an upper bound of length W, 
i.e., m q — l q + 1 < W, is given by 

Q 

To determine the initial and ending loci of each block 
[l q : m q ] , q = 1, . . . Q that minimizes £ we use the 
recursive method explained in [25], i.e., for each ending 
locus 1 < m < L we determine the block [/^ : m], 
with m — l* m + 1 < W, that contributes with the lowest 
entropy and then backtrack the best initial points l* m for 
each consecutive block by starting with the block [/|, L], 

Missing data 

Genotyping errors often occur when the observed geno- 
type of an individual differs from the original sequence 
for various reasons [31,32]. A particular type of geno- 
typing error is the case when some loci are not 
observed/missed during sequencing or other application 
processes. Although methods dealing with some type of 
errors were proposed, often erroneous genotypes are pro- 
duced with significant missing/error rates [33]. Therefore, 
it is of high importance for an xor-haplotyping technique 
to be adaptive for resolving such databases with missing 
sites. We next present a modification to XHSD in order 
to perform xor-haplotyping for the individuals exposed to 
missing data conditions. 

Let g t be the incomplete genotype of the /-th individual 
where the loci with missing information in^ are removed. 
Similarly, let X{ represent the xor-genotype of the /-th indi- 
vidual where the missing loci are removed. As the rate of 
missing loci increases the sequences become less infor- 
mative. Following the suggestion in [25], we introduce 
another weight w(.) to give less weight to the less informa- 
tive individuals when evaluating (12) in order to improve 
the reliability of haplotype inference, i.e., 

L i (A) = \ w( ^ i) bi min vJi^-^!4^ii 2 > g iven h 

\w(Xi) min^. ||^/ - (Z^ v/) 2 || 2 , given x t . 

(13) 

where Z^ is the matrix Z with the rows corresponding 
to the missing loci of the /-th individual removed. The 
weight is selected as a nondecreasing function of the total 
information content in the sequence such that 

w(Xi) = dimfe) 2 , (14) 

where dim(#;) gives the dimension of X{. 



Different weight functions could be employed to exploit 
the distribution of missing sites. Since, in our experi- 
ments, the missing sites are uniformly distributed across 
the SNPs and individuals the function in (14) gave a good 
performance. 

The proposed method does not account for the direct 
inference of the missing sites, i.e., imputing missing geno- 
types [34]. However, the missing values in each xor- 
genotype can be recovered from the solution by simply 
looking at the haplotype pairs which are specifically 
inferred for each individual. Since the proposed method 
has robust performance against missing data, as presented 
in the next section, the inferred solution will be sufficient 
to type missing genotype sites. An implementation of the 
proposed method -with aforementioned extensions- is 
provided in "Additional file 1". 

Results and discussion 

We tested the performance of several xor-haplotyping 
methods with a number of metrics. First we measured the 
probability of error {P e )> i.e., the percentage of individu- 
als whose inferred pair of haplotypes are different from 
the original pair. This measure is sensible for assessing 
the inference quality in regular haplotyping problem since 
the alleles corresponding to homozygous loci are known 
and only the heterozygous loci are ambiguous thereby 
performance depends on the inference accuracy on het- 
erozygous loci. Nonetheless, in xor-haplotyping there are 
a large number of equivalent solutions to original one 
up to bit flipping and thereby it is very likely that a 
solution set differs from the original phasing on at least 
one SNP. In particular, for a given xor-genotype even if 
there is a single SNP difference (namely bit flip) between 
the pair of inferred haplotypes and the pair of haplo- 
types that originally gave rise to that xor-genotype, it is 
counted as mis-inference. A more sensible metric, there- 
fore, would take into account the percentage of such SNPs 
where the inference differs from the true phasing. In that 
sense, the switch error rate (swr) [35] is a proper metric 
that counts the minimum amount of required switches 
for heterozygous loci to change to the correct alleles of 
the original haplotypes. It gives a sense of how closely 
the inference was made, i.e., as a ratio of total mis- 
inferred heterozygous loci miss^ in all individuals i = 
{1 . . .N} to the worst-case number of switches (half of 
the number of heterozygous loci in each individual X//2), 
i.e., 

SWr = V Af x, • 

Moreover, to assess the accuracy on homozygous sites, 
we employ prediction error rate {err p ) [23] computed as 
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the fraction of incorrectly predicted hidden-homozygous 
sites out of all hidden-homozygous sites, i.e., 



err n 



E 



N 



miss 



,hom 



We performed xor-haplotyping on various data sets, 
with and without missing information on loci: synthetic 
data sets with different recombination rates simulated by 
a coalescence based program of [36], a database consisting 
of the SNPs in the CFTR gene that is associated with cystic 
fibrosis (CF) disorder [37], and another database (ANRIL) 
containing the SNPs that have relatively lower linkage 
disequilibrium (high polymorphism). We tested differ- 
ent xor-haplotyping methods that are based on different 
assumptions including the parsimony graph realization 
model PPXH [24], the parsimony genetic search model 
XOR-HAPLOGEN (XHAP) [23], the graph representa- 
tion model GREAL [15], and an integer programming 
approach Poly-IP [38]. Among the four methods the last 
two were ineffective for practical reasons. GREAL failed 
at finding solutions for data sets with reasonably long 
sequences (SNPs > 30), and Poly-IP method is often 
computationally inefficient when solving even a simple 
problem (e.g., it takes more than 24 hours to solve a set of 
50 individuals with 30 SNPs). 

Synthetic data 

Based on different recombination rates three different 
scenarios are considered in synthetic data sets: no recom- 
bination (r = 0), and recombination with rates r = 4 
and r = 40, respectively. The recombination rate is 
the rate that the haplotypes of an individual exchange 



the sequence fragments due to several reasons such as 
crossing-over events. This fact is simulated by a model 
given in Hudson's software [36]. For each scenario we gen- 
erated 100 different data sets by random pairing of a set 
of simulated haplotypes of different lengths (5 < L < 46) 
for a given population size. This is repeated for different 
population sizes as well, N e {10, 20, 30, 40, 50}. 

In Figure 3, the performances of different methods on 
short data sets (L < 14) are displayed which is based only 
on xor-genotypes. The quality of inference is exhaustively 
determined after removing all bit flip degrees of freedom 
by looking for the best equivalent set of a particular infer- 
ence, i.e., performing an exhaustive search to find the best 
bit flipping that gives a result closest to the true phasing of 
xor-genotypes. Such evaluation shows the best inference 
performance of different methods without the help of reg- 
ular genotypes. Compared to other methods, XHSD can 
potentially resolve a set of xor-genotypes with compara- 
bly low error rates. Moreover, XHSD achieves the lowest 
switch error rates, especially for large datasets, indicating 
a better accuracy (i.e., similarity with the true haplotypes) 
for the initial inference given only the xor-genotypes. 

To evaluate the inference quality when regular geno- 
type data are available, we first determined only a limited 
number of regular genotypes by the MTI method, i.e., 
the smallest set of regular genotypes that have empty 
intersection on the heterozygous SNPs, then resolved the 
ambiguity by bit flipping on the initial inference according 
to these regular genotypes (Figure 4). This test evaluates 
how methods can deal with bit-flip degree of freedom 
under very limited regular genotype data that -in theory- 
suffice to resolve all SNPs. Given the long xor-genotype 
data sets (5 < L < 46), block partitioning is applied in 
XHSD by limiting the maximum block size to W — 8 
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Figure 3 Potential inference quality on short {L < 14) synthetic data. 
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Figure 4 Performance on long (5 < L < 46) synthetic data by bit flipping via 2 regular genotypes. 



SNPs. From Figure 4, we can say that XHSD has the best 
potential to make an inference with high accuracy when 
the regular genotypes are introduced. We also applied 
the proposed XHSD framework represented in Figure 2 
to the same dataset where 2 xor-genotypes are replaced 
with the regular genotypes. Note that the Proposed XHSD 
achieves a significant decrease in P e rates despite the small 
augmentation of data by only 2 regular genotypes, com- 
pared to using them in the post-processing, i.e., XHSD 
(bit flipping). 

It is worthy of noting that the algorithms based on 
segmentation may deteriorate when processing long xor- 
genotype sequences, especially with increasing recombi- 
nation rates where the detection of haplotype blocks is 
complicated [39] . We used block partitioning (segmenta- 
tion) in XHSD to reduce complexity when processing long 
xor-genotype sequences. In Figure 4 the segmentation 
effect is noticeable particularly in very high recombination 
rates, i.e., r = 40. However, in general scenario, i.e., r < 4, 
we can say that the segmentation effect is not significant 
for the proposed methods performance, and it outper- 
forms XOR-HAPLOGEN in most data sets containing 
typical recombination rates. 

For more practical results we added regular genotypes 
in each method with different percentages of the popu- 
lation and allowed the methods to remove ambiguity by 
their own, except for PPXH. Since PPXH cannot make use 
of regular genotypes directly, we applied bit flipping using 
the MTI solver to remove ambiguity for this method. To 
regularly genotype a given percentage of the population, 
the regular genotypes are determined by running the MTI 
method several times until the number of distinct regular 
genotypes obtained achieves the given percentage of the 
total number of individuals. 



Figure 5 shows performances on the synthetic data of 
a large population of 50 individuals with zero recom- 
bination rate, where cases are considered from 10% 
(5 individuals) to 100% (50 individuals) of the popula- 
tion are given by regular genotypes. XHSD over-performs 
other methods in almost all cases. Particularly after 20% of 
the population is given by regular genotypes, XHSD can 
immediately utilize regular genotypes and significantly 
improve the accuracy on both homozygous (err p ) and 
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Figure 5 Performance on long (5 < L < 46) synthetic data from 
50 individuals by employing different numbers of regular 
genotypes. 
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heterozygous sites (swr). We can conclude that the par- 
simony principle of XHSD method is well-suited for 
inferring the heterozygous sites, and for predicting the 
homozygous sites it usually suffices to have a small per- 
centage of regular genotypes. 

Missing data 

We investigated capability for dealing with missing data 
under different circumstances by various methods. Since 
the methods performed similarly under zero recombina- 
tion rate we used the same data sets with no recombina- 
tion to generate the database with missing entries. An SNP 
site of an individual is defined as "missing" with a proba- 
bility of Pmiss and the data sets for different percentages of 
missing SNPs are generated accordingly. PPXH method is 
excluded since it cannot handle missing data. In XHSD the 
block partitioning is applied as before with a maximum 
block size of W = 8 SNPs. 

Figures 6 and 7 show the performances in different 
scenarios of partial regular genotyping under different 
rates of missing data. As in the previous plots, each 
point represents the average value of the correspond- 
ing metric over 100 realizations- 100 different sets of 
varying SNP sizes between 5 and 46. In most cases, 
XOR-HAPLOGEN and XHSD are insensitive to the 
increased number of missing sites. XOR-HAPLOGEN is 
more accurate for small group of individuals. Nonethe- 
less, when more individuals are available in the database 
(N > 30) XHSD displays a better performance in all 
circumstances. 

We examined the dependency of methods on percent- 
age of the missing data rate for a population with large 
number of individuals. That is, we used the xor-genotypes 
from 50 individuals and replaced 30% and 50% of the 



population with regular genotypes, and performed xor- 
haplotype inference under different missing data rates 
ranging from 0.5% to 5%. As seen in Figure 8 both 
methods are robust against missing data. On the other 
hand, XHSD is less dependent on regular genotypes and 
it can achieve better error rates than XOR-HAPLOGEN 
by employing even less number of regular genotypes. 
XOR-HAPLOGEN needs approximately 20% more regu- 
lar genotypes to reach the same P e level with XHSD, e.g., 
regular genotyping by 30% in XHSD is comparable to that 
of 50% in XOR-HAPLOGEN. 



CFTR gene database 

Cystic fibrosis (CF) is an autosomal recessive disor- 
der caused by mutations in the gene that encodes the 
cystic fibrosis transmembrane conductance regulator pro- 
tein (CFTR). In [37], various mutations on 23 polymor- 
phic locations from the chromosome 7 are detected as 
the disease loci for CF. We used this database corre- 
sponding to 29 distinct haplotypes to generate random 
xor-genotypes. By combining the haplotype pairs at ran- 
dom we generated the xor-genotypes for a given number 
of individuals N, and repeated the process for different 
population sizes, i.e., N e {100,200,300,400}. In this 
database, the data sets with small number of individu- 
als present high haplotype diversities, i.e., many of the 
distinct haplotypes are only used once in the generation 
of individuals. Therefore, the larger data sets that have 
low haplotype diversities are expected to be solved with 
higher accuracy by biologically-oriented methods, such as 
XOR-HAPLOGEN which obtains its inference according 
to a multi-locus linkage disequilibrium (LD)-based block 
identification model. 
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Figure 6 Performance under low rates of missing data, long (5 < L < 46) synthetic data. 
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Figure 7 Performance under high rates of missing data, long (5 < L < 46) synthetic data. 



We tested the performance of each method on 
this database with/without missing sites {0,5%}. PPXH 
method was excluded from the missing data analysis 
since it cannot deal with missing data. XHSD is applied 
with block partitioning and the maximum block length 
of W = 8 SNPs as before. It is seen in Figure 9 that 
XHSD out-performs for various population sizes with 
significantly low error rates. As the xor-genotypes are 
taken from more individuals, the inference accuracy is 
immediately improved in XHSD and XOR-HAPLOGEN, 



whereas PPXH do not have this ability to benefit from 
the additional data. 

Figure 10 shows the average running times of each 
method performing on this database. It is observed 
that XHSD has similar computational complexity as 
the size of data set grows, and it shows compara- 
ble running times with XOR-HAPLOGEN. Although 
PPXH performs significantly faster, it cannot mitigate 
the high error rates and is not able to provide accurate 
inferences. 
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Figure 8 Performance under different percentages of missing data. 
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Figure 9 Performance on CFTR gene database with different population sizes with/without missing data. 



Typing errors 

Combinatorial optimization techniques are known with 
their sensitivity to genotyping errors [40]. Thereby, we 
tested the effect of typing errors on the proposed algo- 
rithm using CFTR gene database. We defined a SNP site of 
an individual as erroneous with a probability of P err , and 
typed the site as either homozygous or heterozygous with 
equal probabilities. We then run the algorithms without 
providing the knowledge of erroneous site positions. We 
excluded PPXH method due to its low performance on the 
CFTR database. Figure 11 illustrates the algorithms' per- 
formance on typing errors with P err = 2%. It is seen that 
XOR-HAPLOGEN is a more robust method against typ- 
ing errors because of its statical nature. Nonetheless, the 
proposed XHSD algorithm can deal with erroneous data 
containing ~2% typing errors, with a small increase in the 
error rates compared to the results without typing errors. 

ANRIL database 

The performance of haplotyping methods can deteriorate 
on databases with decreasing linkage disequilibrium (LD) 
rates. A SNP database with low pairwise-LD scores are 
investigated in an association study given in [41] for their 
susceptibility to certain types of leukemia. This database 



includes 16 SNPs from the chromosome 9p21 associ- 
ated with several diseases and a SNP locus encoding for 
anti-sense non-coding RNA in the INK4 locus (ANRIL) 
[42], We used the corresponding haplotype data from 
HAPMAP database (http://hapmap.ncbi.nlm.nih.gov/) 
collected from 90 European individuals. We generated the 
xor-genotypes for the individuals by using their haplotype 
pairs and tested the algorithms on this database. It is seen 
from the Figure 12 that the algorithms deteriorate when 
inferring the haplotypes with low-LD SNPs. XHSD shows 
very similar performance with XOR-HAPLOGEN, and 
both methods over-perform PPXH on this database. 

Notice that the algorithms cannot mitigate the error 
rates with increasing number of individuals. This can be 
explained by the occurrence of very high haplotype diver- 
sity in corresponding low-LD SNP regions. The num- 
ber of distinct haplotypes explaining the given number 
of individuals presumably remains at high diversity as 
the number of individuals grows, whereas the methods 
based on maximum parsimony principle fail to incor- 
porate this fact. They are tend to find parsimonious 
(low-diversity) solutions in all population sizes, with a 
decreasing ratio (p) of "total number of distinct haplo- 
types explaining the given set of individuals" to "total 
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Figure 1 1 Performance on CFTR gene database for different population sizes, with P err =2%, with/without missing data. 



number of given individuals" as the population size 
grows. It is worthy of noticing that, in XHSD results in 
Figure 12 (P m i ss =0)> we observed that such ratio decreases 
as p =[1.3,0.95,0.83,0.72,0.66] in respect to the pop- 
ulations with 10, 20, 30, 40, 50 individuals; whereas the 
same ratio for the true phasing (ground truth data) is 
in fact much higher, i.e., p = [1.7, 1.48, 1.34, 1.27, 1.24], 
respectively, thereby causing the parsimony-based haplo- 
typing methods to deteriorate on this database. On the 
other hand, in high-LD CFTR database, the same ratio 
for the true phasing is very low due to low haplotype 
diversity, i.e., p = [0.29,0.14,0.1,0.07], in respect to the 
populations with 100, 200, 300, 400 individuals, and the 
XHSD method is good at achieving very similar rates, 
i.e., p = [0.43,0.15,0.1,0.07], respectively. 



Conclusions 

In this paper, we have presented a new xor-haplotyping 
method XHSD based on the maximum parsimony prin- 
ciple that infers the haplotype pairs for each member of 
a group of unrelated individuals by observing their xor- 
genotypes. A dictionary selection method is utilized to 
find the smallest set of haplotypes selected from a candi- 
date set that can explain the given set of xor-genotypes. 
The proposed approach requires regular genotypes from 
only a small percentage of individuals for the removal of 
ambiguity across all SNPs of the inferred haplotypes. The 
smallest subgroup of individuals having the most infor- 
mative regular genotypes are efficiently determined by 
the minimum tree intersection algorithm. Although the 
inference accuracy was proportional to the percentage of 
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the individuals given by regular genotypes, XHSD shows 
less dependency on regular genotypes compared to other 
methods. Experimental results have demonstrated that 
XHSD is a reliable method for xor-haplotyping under all 
circumstances including missing data and typing error 
cases. Low rates of missing values (< 10%) on the xor- 
genotypes has often insignificant contribution to the error 
rates, and the proposed method can deal with ~ 2% typing 
errors. Particularly for large databases, XHSD produces 
the most accurate solution with significantly low error 
rates compared to other low-complexity xor-haplotyping 
methods. Experiments with CFTR gene database also 
proved that our approach can perform effectively on real 
data sets with/without missing sites. Another database 
with particularly lower LD rates indicates that the pro- 
posed algorithm can achieve the best performance with 
the state-of-the-art algorithms. We expect that XHSD 
can serve as a practical tool for xor-haplotyping on 
real-world large instances, as the large data collections 
become more available in the era of next-generation DNA 
sequencing. 
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Additional file 1 : Matlab implementation. This file includes the Matlab 
code of the proposed algorithm, and an implementation with the example 
database, CFTR. 
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